今天是鐵人賽挑戰的第 30 天拉,來把分析 Kafka PR 的坑填完。
我們想要知道 魔鏡啊 魔鏡啊 誰是 Apache Kafka 最近 100隻 PR 裡面 reveiw 最多的人?
GitHub 對於匿名 api 的 rate limit 是一小時 50 次
因為我們會超過 rate limit 所以要設定 github 的 Bearer Token,這邊就請大家問 chatGPT 怎麼弄到自己帳號的 Bearer Token
CREATE SECRET github (
TYPE HTTP,
P {
'Authorization': 'Bearer <github_pat_ 開頭的 github token>'
}
);
得到最近 100 隻的 PR 資訊
CREATE TABLE kafka AS
FROM read_json_auto("https://api.github.com/repos/apache/kafka/pulls?state=closed&sort=updated&direction=desc&per_page=100");
這 100 隻 PR 分別對應的 comments api endpoint 跟 review_comments api endpoint
CREATE TABLE kafka_comments AS
SELECT number, _links.comments.href as comments , _links.review_comments.href as review_comments
FROM kafka;
把對應 api endpoint 存成一個 sql variable
SET VARIABLE normal_comments = (SELECT List("comments") FROM kafka_comments);
SET VARIABLE review_comments = (SELECT List("review_comments") FROM kafka_comments);
分別建立 comments 和 review_comments 的 table
**這部分需要 10 分鐘,因為我們打了超過 200 次 github api 並且還要 parse json 建成 table **
CREATE TABLE nc AS
SELECT * FROM read_json_auto(getvariable('normal_comments') , union_by_name = true );
CREATE TABLE rc AS
SELECT * FROM read_json_auto(getvariable('review_comments') , union_by_name = true );
得到每一隻 PR 有哪些人 review
CREATE TABLE combined_pr_reviewers (
pr TEXT PRIMARY KEY,
reviewers TEXT[]
);
WITH review_comments AS (
SELECT pull_request_url as pr,
array_agg(DISTINCT user.login) as reviewers
FROM rc
GROUP BY pr
),
issue_comments AS (
SELECT issue_url as pr,
array_agg(DISTINCT user.login) as reviewers
FROM nc
GROUP BY pr
)
INSERT INTO combined_pr_reviewers (pr, reviewers)
SELECT pr, reviewers
FROM (
SELECT pr, reviewers
FROM review_comments
UNION ALL
SELECT pr, reviewers
FROM issue_comments
) AS combined
ORDER BY pr;
WITH unnested_reviewers AS (
SELECT unnest(reviewers) AS reviewer
FROM combined_pr_reviewers
)
SELECT
reviewer,
COUNT(*) AS review_count
FROM unnested_reviewers
GROUP BY reviewer
ORDER BY review_count DESC
LIMIT 10;
reviewer | review_count |
---|---|
github-actions[bot] | 45 |
chia7712 | 33 |
asfbot | 25 |
ijuma | 11 |
ewencp | 9 |
apoorvmittal10 | 7 |
junrao | 6 |
mingyen066 | 6 |
mimaison | 5 |
AndrewJSchofield | 5 |
chia7712 是鬼吧 !
To see is to believe , 趕快到 https://shell.duckdb.org/ 驗證一下 chia7712 是不是 review 最多的人。
完成 30 天的鐵人賽拉,說實話比我想像中困難很多,謝謝我的隊友的鼓勵,沒有你們很難寫完這 30 天的 DuckDB 分析。
希望大家經過這 30 天有更了解 DuckDB 😍